sesameData package provides associated data for sesame package. This includes example data for testing and instructional purpose, as we ll as probe annotation for different Infinium platforms.
Titles of all the available data can be shown with:
Each sesame datum from ExperimentHub is accessible through the sesameDataGet interface. It should be noted that all data must be pre-cached to local disk before they can be used. This design is to prevent conflict in annotation data caching and remove internet dependency. Caching needs only be done once per sesame/sesameData installation. One can cache data using
Once a data object is loaded, it is stored to a tempoary cache, so that the data doesn’t need to be retrieved again next time we call sesameDataGet. This design is meant to speeed up the run time.
For example, the annotation for HM27 can be retrieved with the title:
It’s worth noting that once a data is retrieved through the sesameDataGet inferface or the sesameData_getManifestDF inferface (below), it will stay in memory so next time the object will be returned immediately. This design avoids repeated disk/web retrieval. In some rare situation, one may want to redo the download/disk IO, or empty the cache to save memory. This can be done with:
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 4525759 241.8 8203962 438.2 NA 7199088 384.5
## Vcells 8769739 67.0 15302957 116.8 16384 12145785 92.7
Sesame provides some utility functions to process transcript models, which can be represented as data.frame, GRanges and GRangesList objects. For example, sesameData_getTxnGRanges calls sesameDataGet("genomeInfo.mm10")$txns to retrieve a transcript-centric GRangesList object from GENCODE including its gene annotation, exon and cds (for protein-coding genes). It is then turned into a simple GRanges object of transcript:
## GRanges object with 142604 ranges and 9 metadata columns:
## seqnames ranges strand | transcript_type
## <Rle> <IRanges> <Rle> | <character>
## ENSMUST00000193812.1 chr1 3073253-3074322 + | TEC
## ENSMUST00000082908.1 chr1 3102016-3102125 + | snRNA
## ENSMUST00000162897.1 chr1 3205901-3216344 - | processed_transcript
## ENSMUST00000159265.1 chr1 3206523-3215632 - | processed_transcript
## ENSMUST00000070533.4 chr1 3214482-3671498 - | protein_coding
## ... ... ... ... . ...
## ENSMUST00000082419.1 chrM 13552-14070 - | protein_coding
## ENSMUST00000082420.1 chrM 14071-14139 - | Mt_tRNA
## ENSMUST00000082421.1 chrM 14145-15288 + | protein_coding
## ENSMUST00000082422.1 chrM 15289-15355 + | Mt_tRNA
## ENSMUST00000082423.1 chrM 15356-15422 - | Mt_tRNA
## transcript_name gene_name gene_id
## <character> <character> <character>
## ENSMUST00000193812.1 4933401J01Rik-201 4933401J01Rik ENSMUSG00000102693.1
## ENSMUST00000082908.1 Gm26206-201 Gm26206 ENSMUSG00000064842.1
## ENSMUST00000162897.1 Xkr4-203 Xkr4 ENSMUSG00000051951.5
## ENSMUST00000159265.1 Xkr4-202 Xkr4 ENSMUSG00000051951.5
## ENSMUST00000070533.4 Xkr4-201 Xkr4 ENSMUSG00000051951.5
## ... ... ... ...
## ENSMUST00000082419.1 mt-Nd6-201 mt-Nd6 ENSMUSG00000064368.1
## ENSMUST00000082420.1 mt-Te-201 mt-Te ENSMUSG00000064369.1
## ENSMUST00000082421.1 mt-Cytb-201 mt-Cytb ENSMUSG00000064370.1
## ENSMUST00000082422.1 mt-Tt-201 mt-Tt ENSMUSG00000064371.1
## ENSMUST00000082423.1 mt-Tp-201 mt-Tp ENSMUSG00000064372.1
## gene_type source level cdsStart
## <character> <character> <character> <integer>
## ENSMUST00000193812.1 TEC HAVANA 2 <NA>
## ENSMUST00000082908.1 snRNA ENSEMBL 3 <NA>
## ENSMUST00000162897.1 protein_coding HAVANA 2 <NA>
## ENSMUST00000159265.1 protein_coding HAVANA 2 <NA>
## ENSMUST00000070533.4 protein_coding HAVANA 2 3670552
## ... ... ... ... ...
## ENSMUST00000082419.1 protein_coding ENSEMBL 3 13555
## ENSMUST00000082420.1 Mt_tRNA ENSEMBL 3 <NA>
## ENSMUST00000082421.1 protein_coding ENSEMBL 3 14145
## ENSMUST00000082422.1 Mt_tRNA ENSEMBL 3 <NA>
## ENSMUST00000082423.1 Mt_tRNA ENSEMBL 3 <NA>
## cdsEnd
## <integer>
## ENSMUST00000193812.1 <NA>
## ENSMUST00000082908.1 <NA>
## ENSMUST00000162897.1 <NA>
## ENSMUST00000159265.1 <NA>
## ENSMUST00000070533.4 3671348
## ... ...
## ENSMUST00000082419.1 14070
## ENSMUST00000082420.1 <NA>
## ENSMUST00000082421.1 15288
## ENSMUST00000082422.1 <NA>
## ENSMUST00000082423.1 <NA>
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
The returned GRanges object does not contain the exon coordinates. We can further collapse different transcripts of the same gene (isoforms) to gene level. Gene start is the minimum of all isoform starts and end is the maximum of all isoform ends.
## GRanges object with 55401 ranges and 2 metadata columns:
## seqnames ranges strand | gene_name
## <Rle> <IRanges> <Rle> | <character>
## ENSMUSG00000102693.1 chr1 3073253-3074322 + | 4933401J01Rik
## ENSMUSG00000064842.1 chr1 3102016-3102125 + | Gm26206
## ENSMUSG00000051951.5 chr1 3205901-3671498 - | Xkr4
## ENSMUSG00000102851.1 chr1 3252757-3253236 + | Gm18956
## ENSMUSG00000103377.1 chr1 3365731-3368549 - | Gm37180
## ... ... ... ... . ...
## ENSMUSG00000064368.1 chrM 13552-14070 - | mt-Nd6
## ENSMUSG00000064369.1 chrM 14071-14139 - | mt-Te
## ENSMUSG00000064370.1 chrM 14145-15288 + | mt-Cytb
## ENSMUSG00000064371.1 chrM 15289-15355 + | mt-Tt
## ENSMUSG00000064372.1 chrM 15356-15422 - | mt-Tp
## gene_type
## <character>
## ENSMUSG00000102693.1 TEC
## ENSMUSG00000064842.1 snRNA
## ENSMUSG00000051951.5 protein_coding
## ENSMUSG00000102851.1 processed_pseudogene
## ENSMUSG00000103377.1 TEC
## ... ...
## ENSMUSG00000064368.1 protein_coding
## ENSMUSG00000064369.1 Mt_tRNA
## ENSMUSG00000064370.1 protein_coding
## ENSMUSG00000064371.1 Mt_tRNA
## ENSMUSG00000064372.1 Mt_tRNA
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
Get nearby genes given probes
Sesame provide parsers to handle manifest files hosted on Github. One may choose to return a data frame which includes all probes or a GRanges object which includes only mapped probes.
## GRanges object with 485545 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## cg13869341 chr1 15865-15866 -
## cg14008030 chr1 18827-18828 -
## cg12045430 chr1 29407-29408 -
## cg20826792 chr1 29425-29426 -
## cg00381604 chr1 29435-29436 -
## ... ... ... ...
## cg14273923 chrY 26409765-26409766 -
## cg05740793 chrM 1314-1315 -
## cg24159721 chrM 4620-4621 +
## cg17501828 chrM 6807-6808 +
## cg08858441 chrM 8878-8879 +
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
Note that by default the GRanges object exclude decoy sequence probes (e.g., _alt, and _random contigs). To include them, we need to use the decoy = TRUE option.
## [1] 485545
## [1] 485569
One can directly get probes from different parts of the genome.
regs <- GRanges('chr5', IRanges(135313937, 135419936))
sesameData_getProbesByRegion(regs, platform = 'Mammal40')## GRanges object with 10 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## cg18945109 chr5 135350775-135350776 +
## cg14826942 chr5 135350857-135350858 -
## cg14620903 chr5 135350865-135350866 -
## cg12825194 chr5 135350880-135350881 -
## cg10071034 chr5 135350884-135350885 -
## cg04472379 chr5 135350953-135350954 +
## cg25568354 chr5 135350962-135350963 +
## cg08401998 chr5 135351012-135351013 +
## cg23345269 chr5 135367261-135367262 +
## cg22464003 chr5 135369530-135369531 +
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
## GRanges object with 1294 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## cg02171705 chrX 9463141-9463142 +
## cg01252899 chrX 9463189-9463190 +
## rs2521373_II_F_C_37521 chrX 9508950 -
## cg26545086 chrX 9944794-9944795 -
## cg14704094 chrX 10566874-10566875 -
## ... ... ... ...
## cg04337186 chrX 154031401-154031402 +
## cg16330204 chrX 154766280-154766281 +
## cg00547789 chrX 155264465-155264466 -
## cg10512285 chrX 155264469-155264470 -
## cg18230281 chrX 155264505-155264506 +
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
## GRanges object with 36126 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## cg08067365 chr1 1013513-1013514 -
## cg13449535 chr1 1165968-1165969 +
## cg19945840 chr1 1232656-1232657 -
## cg13587552 chr1 1281061-1281062 -
## cg13354934 chr1 1360713-1360714 -
## ... ... ... ...
## cg17285325 chr22 50529914-50529915 -
## cg00083937 chr22 50601376-50601377 +
## cg00256932 chr22 50603303-50603304 +
## cg13194594 chr22 50679061-50679062 +
## cg19491113 chr22 50722054-50722055 -
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
## GRanges object with 9 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## cg11228575 chr2 25228462-25228463 -
## cg16316743 chr2 25232953-25232954 -
## cg23393100 chr2 25234335-25234336 -
## cg20545546 chr2 25240304-25240305 -
## cg19346456 chr2 25240337-25240338 -
## cg09611799 chr2 25240401-25240402 -
## cg00206304 chr2 25241581-25241582 -
## cg15034063 chr2 25247679-25247680 -
## cg26550430 chr2 25274954-25274955 -
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## cg00206304 chr2 25241581-25241582 -
## cg15034063 chr2 25247679-25247680 -
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
## GRanges object with 1 range and 2 metadata columns:
## seqnames ranges strand | gene_name
## <Rle> <IRanges> <Rle> | <character>
## ENSG00000113648.16 chr5 135333900-135399914 - | MACROH2A1
## gene_type
## <character>
## ENSG00000113648.16 protein_coding
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
Here we demonstrate downloading a .rds from the annotation website.
## Retrieving annotation from https://github.com/zhou-lab/InfiniumAnnotationV1/raw/main/EPIC/EPIC.hg19.typeI_overlap_b151.rds... Done.
One can also download annotation file to disk
## $url
## [1] "https://github.com/zhou-lab/InfiniumAnnotationV1/raw/main/test/3999492009_R01C01_Grn.idat"
##
## $dest_dir
## [1] "/var/folders/hn/w06x7wkn2_v1c83rsm38qmcc2qh0vm/T//RtmpWOHSaS"
##
## $dest_file
## [1] "/var/folders/hn/w06x7wkn2_v1c83rsm38qmcc2qh0vm/T//RtmpWOHSaS/test/3999492009_R01C01_Grn.idat"
##
## $file_name
## [1] "test/3999492009_R01C01_Grn.idat"
From Bioconductor
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("sesameData")Development version can be installed from github
## R Under development (unstable) (2021-11-09 r81170)
## Platform: x86_64-apple-darwin20.6.0 (64-bit)
## Running under: macOS Big Sur 11.6.1
##
## Matrix products: default
## BLAS: /Users/zhouw3/.Renv/versions/4.2.dev/lib/R/lib/libRblas.dylib
## LAPACK: /Users/zhouw3/.Renv/versions/4.2.dev/lib/R/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GenomicRanges_1.47.6 GenomeInfoDb_1.31.4 IRanges_2.29.1
## [4] S4Vectors_0.33.10 sesameData_1.13.23 rmarkdown_2.11
## [7] ExperimentHub_2.3.5 AnnotationHub_3.3.8 BiocFileCache_2.3.4
## [10] dbplyr_2.1.1 BiocGenerics_0.41.2
##
## loaded via a namespace (and not attached):
## [1] Biobase_2.55.0 httr_1.4.2
## [3] sass_0.4.0 pkgload_1.2.4
## [5] vroom_1.5.7 bit64_4.0.5
## [7] jsonlite_1.7.3 bslib_0.3.1
## [9] shiny_1.7.1 assertthat_0.2.1
## [11] interactiveDisplayBase_1.33.0 BiocManager_1.30.16
## [13] blob_1.2.2 GenomeInfoDbData_1.2.7
## [15] yaml_2.2.2 remotes_2.4.2
## [17] sessioninfo_1.2.2 BiocVersion_3.15.0
## [19] pillar_1.7.0 RSQLite_2.2.9
## [21] glue_1.6.1 digest_0.6.29
## [23] XVector_0.35.0 promises_1.2.0.1
## [25] htmltools_0.5.2 httpuv_1.6.5
## [27] pkgconfig_2.0.3 devtools_2.4.3
## [29] zlibbioc_1.41.0 purrr_0.3.4
## [31] xtable_1.8-4 processx_3.5.2
## [33] later_1.3.0 tzdb_0.2.0
## [35] tibble_3.1.6 KEGGREST_1.35.0
## [37] generics_0.1.2 usethis_2.1.5
## [39] ellipsis_0.3.2 cachem_1.0.6
## [41] withr_2.4.3 cli_3.1.1
## [43] magrittr_2.0.2 crayon_1.4.2
## [45] mime_0.12 memoise_2.0.1
## [47] evaluate_0.14 ps_1.6.0
## [49] fs_1.5.2 fansi_1.0.2
## [51] pkgbuild_1.3.1 tools_4.2.0
## [53] prettyunits_1.1.1 hms_1.1.1
## [55] lifecycle_1.0.1 stringr_1.4.0
## [57] AnnotationDbi_1.57.1 callr_3.7.0
## [59] Biostrings_2.63.1 compiler_4.2.0
## [61] jquerylib_0.1.4 rlang_1.0.1
## [63] RCurl_1.98-1.5 rappdirs_0.3.3
## [65] bitops_1.0-7 testthat_3.1.1
## [67] DBI_1.1.2 curl_4.3.2
## [69] R6_2.5.1 knitr_1.37
## [71] dplyr_1.0.7 fastmap_1.1.0
## [73] bit_4.0.4 utf8_1.2.2
## [75] filelock_1.0.2 rprojroot_2.0.2
## [77] readr_2.1.2 desc_1.4.0
## [79] stringi_1.7.6 parallel_4.2.0
## [81] Rcpp_1.0.8 vctrs_0.3.8
## [83] png_0.1-7 tidyselect_1.1.1
## [85] xfun_0.29